Lab Assignment One: Exploring Table Data¶

Course: Machine Learning in Python¶

Semester: Fall 2023¶

Authors:¶

Ritik Khandelwal (49347408)

Prashant Iyer (49352530)

Brian Ronald Mendes (49243148)

Team: Perceptron¶


Introduction¶

In this lab exercise, we would be analysing a bank marketing dataset for customers investing in deposit schemes. This would help the bank representatives understand and improve the strategizing procedure for customer selection.

Data Sources¶

Bank Marketing Dataset:

  1. UCI: (https://archive.ics.uci.edu/dataset/222/bank+marketing)
  2. Kaggle: (https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset).

The research paper associated with this data can be found in the following journal:

Paper Citation: Moro, S., Cortez, P. & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems. 62, 22-31 https://dx.doi.org/10.1016/j.dss.2014.03.001

1. Business Understanding¶

"Deposits" - The term that helps the bank function smoothly. Every bank tries to get their customer to invest in different deposit schemes in their bank. They give good competetive interest rates and offers to keep the customer tied to the schemes. They promote such schemes through various mediums like cellular, telephonic, in-person, emails etc. Every bank tries to analyse various customer factors like their credit history, balances, salary, education etc. when defining the best rates that can be offered.

There is a marketing team that looks after this part of business. They find various ways to communicate the benefits of the deposit scheme to the customers. Some invest and some don't. This Bank Marketing dataset tries to look at all these factors as a whole to predict if a customer would invest or not. Thus, pro-active measures can be taken in the future to on board the right set of customers.

The Bank Marketing Dataset has 11162 rows and 17 attributes that includes both numerical and categorical variables. It originally belongs to the UCI Machine Learning Repository, but can be found on Kaggle as well. We can download this dataset for free from either of the websites.

Analyzing and Visualizing this information can help strategize the right set of customers and their deposit deals and terms. It should definitely give a better result than when chossing any random customers. This is a great step ahead in order to save time and resources of a bank. It gives them the right direction to proceed with in the next deposit campaign cycle.

Problem Addressed:¶

Which next customer would deposit an amount in the bank? Which customers should be included in the bank deposit strategy?

In [1]:
#Libraries used
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import umap
In [2]:
#Read the dataset
df = pd.read_csv('bank.csv')
df.shape
Out[2]:
(11162, 17)

2. Data Quality Checks¶

In [3]:
#categorizing between customer who made deposits and who did not
df.groupby('deposit')['deposit'].count()
Out[3]:
deposit
no     5873
yes    5289
Name: deposit, dtype: int64

Checks for deriving if the datset contains any missing rows or duplicates in the Dataset

In [4]:
#Check For Duplicates
df.duplicated().unique()
Out[4]:
array([False])
In [5]:
# Check for null/missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11162 entries, 0 to 11161
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        11162 non-null  int64 
 1   job        11162 non-null  object
 2   marital    11162 non-null  object
 3   education  11162 non-null  object
 4   default    11162 non-null  object
 5   balance    11162 non-null  int64 
 6   housing    11162 non-null  object
 7   loan       11162 non-null  object
 8   contact    11162 non-null  object
 9   day        11162 non-null  int64 
 10  month      11162 non-null  object
 11  duration   11162 non-null  int64 
 12  campaign   11162 non-null  int64 
 13  pdays      11162 non-null  int64 
 14  previous   11162 non-null  int64 
 15  poutcome   11162 non-null  object
 16  deposit    11162 non-null  object
dtypes: int64(7), object(10)
memory usage: 1.4+ MB

Refering to the above stats, we understand there aren't any duplicates values in the datasets as the duplicate check returns an array of False.

Thus, we need not eliminate any data from the dataset

When checked for the cases to check if there are any missing values, we observed there aren't Null values present in the dataset. However for some of the columns, we observed an entry called unknown which can be associated with missing in some sense. We will be now doing the further checks to decide if these entry have any effect on our dataset and how they can be handled

In [6]:
# Count the number of unknowns in every data column

unknown_count = {}

for col in df.columns:
    count = (df[col] == 'unknown').sum()
    unknown_count[col] = count
    
unknown_count_df = pd.DataFrame(list(unknown_count.items()),columns=['column','unknown_counts'])
unknown_count_df
Out[6]:
column unknown_counts
0 age 0
1 job 70
2 marital 0
3 education 497
4 default 0
5 balance 0
6 housing 0
7 loan 0
8 contact 2346
9 day 0
10 month 0
11 duration 0
12 campaign 0
13 pdays 0
14 previous 0
15 poutcome 8326
16 deposit 0

We can see that poutcome of the campaign is something that is not recorded for a pretty high number of dataset to come to a conclusion on imputation. Also, such high imputations might have negative statistical impact. Thus, we would be ignoring this column in the analysis going ahead.

Given, the other two columns (job, education & contact) are categorical fields, we would not be imputing any values to them and considering unknown as a separate entity thorugh our analysis.

3. Data Manipulations¶

In [7]:
#Renaming the columns
df.rename(columns =  {'housing':'housing_loan','loan':'personal_loan','default':'credit_default','contact':'contact_mode','pdays':'days_since_last_contact'},inplace=True)
In [8]:
#replace (yes,no) in deposit, housing_loan, personal_loan and credit default status with (0,1)

df['deposit'] = df.apply(lambda x: 0 if x['deposit'] == 'no' else 1,axis=1)
df['housing_loan'] = df.apply(lambda x: 0 if x['housing_loan'] == 'no' else 1,axis=1)
df['personal_loan'] = df.apply(lambda x: 0 if x['personal_loan'] == 'no' else 1,axis=1)
df['credit_default'] = df.apply(lambda x: 0 if x['credit_default'] == 'no' else 1,axis=1)

df.describe()
Out[8]:
age credit_default balance housing_loan personal_loan day duration campaign days_since_last_contact previous deposit
count 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000 11162.000000
mean 41.231948 0.015051 1528.538524 0.473123 0.130801 15.658036 371.993818 2.508421 51.330407 0.832557 0.473840
std 11.913369 0.121761 3225.413326 0.499299 0.337198 8.420740 347.128386 2.722077 108.758282 2.292007 0.499338
min 18.000000 0.000000 -6847.000000 0.000000 0.000000 1.000000 2.000000 1.000000 -1.000000 0.000000 0.000000
25% 32.000000 0.000000 122.000000 0.000000 0.000000 8.000000 138.000000 1.000000 -1.000000 0.000000 0.000000
50% 39.000000 0.000000 550.000000 0.000000 0.000000 15.000000 255.000000 2.000000 -1.000000 0.000000 0.000000
75% 49.000000 0.000000 1708.000000 1.000000 0.000000 22.000000 496.000000 3.000000 20.750000 1.000000 1.000000
max 95.000000 1.000000 81204.000000 1.000000 1.000000 31.000000 3881.000000 63.000000 854.000000 58.000000 1.000000

Save the original dataset for reference before converting the numeric age and balance columns to continuous data for any data reduction/modelling purposes.

In [9]:
df_original = df.copy()
df_original.head(3)
Out[9]:
age job marital education credit_default balance housing_loan personal_loan contact_mode day month duration campaign days_since_last_contact previous poutcome deposit
0 59 admin. married secondary 0 2343 1 0 unknown 5 may 1042 1 -1 0 unknown 1
1 56 admin. married secondary 0 45 0 0 unknown 5 may 1467 1 -1 0 unknown 1
2 41 technician married secondary 0 1270 1 0 unknown 5 may 1389 1 -1 0 unknown 1
In [10]:
# Convert discrete age to continuous

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]  # Define your age intervals here

#labels for the age intervals
labels = ['0-10','11-20' ,'21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']

#Create a new column with age intervals
df['age_range'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

# Display the transformed DataFrame
df = df[['age_range', 'job', 'marital', 'education', 'credit_default', 'balance',
       'housing_loan', 'personal_loan', 'contact_mode', 'day', 'month',
       'duration', 'campaign', 'days_since_last_contact', 'previous',
       'poutcome', 'deposit']]
In [11]:
# Convert the discrete Balance to continuous

# We will be using binning to identify the right no of bins that can be used in this case.

# Create a histogram and observe to finalize the number of bins
plt.hist(df['balance'], bins='auto', edgecolor='k')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram with Auto Binning')
plt.show()

#bins = [500,1000,1500,2000]

We can see that the data is ranging over a huge range but most of the data is concentrated between the smaller ranges. Thus, we woud use bins of varied sizes to accomodate all the variance in data

In [12]:
bins = [0,10,5000,15000,45000,90000]

#labels for the age ranges
labels = ['0-10','11-5000','5001-15000','15001-45000','45001-90000']
df['balance'] = df.apply(lambda x: 0 if x['balance'] < 0 else x['balance'],axis=1)
#Create a new column with age ranges
df['balance_range'] = pd.cut(df['balance'], bins=bins, labels=labels, right=False)
df[df['balance_range'].isna()==True]
#Display the transformed DataFrame
df = df[['age_range', 'job', 'marital', 'education', 'credit_default', 'balance_range',
       'housing_loan', 'personal_loan', 'contact_mode', 'day', 'month',
       'duration', 'campaign', 'days_since_last_contact', 'previous',
       'poutcome', 'deposit']]

4. Data Description¶

Index Feature Description Type Range
1 age Age of the customer Integer 18-95
2 job Job of the customer Categorical admin, technician, unemployed.. etc
3 marital Marital Status Categorical Married, Single, divorced, unknown.
4 education Education level of the customer Categorical High School, professional, illiterate etc.
5 credit_default Customer Credit default status Binary Yes/No
6 balance Customer's Average Yearly Account Balance Numeric -6487 - 81204
7 housing_loan Wheather customer has a housing loan (Yes/No) Binary 0/1
8 personal_loan Whether the customer has any personal loan (Yes/No) Binary 0/1
9 contact_mode Mode of promotion contact Categorical Cellular/Telephonic
10 day Day of contact month Numeric 1-31
11 month Month of contact Categorical Jan-Dec
12 duration last contact duration in seconds Numeric 2 - 3881
13 campaign Number of contacts to the customer during the campaign Numeric 1 - 63
14 days_since_last_contact Days passed since last contacted - (-1 -> Inidcates no previous contact) Numeric -1 - 854
15 previous Whether the customer previously accpeted the promotion (Yes/No) Binary 0/1
16 poutcome Outcome of the previous marketing campaign Categorical 'Unknown','Others','Failure,'Success'
17 deposit Whether the customer made a deposit (Yes/No) Binary 0/1 }

5. Visualizations¶

As discussed, our aim is to have the Customer start a deposit in the bank. For that, the marketing team aim to select and target a specific group of customers. To filter out the best set of customers to be chosen, we try the following visualizations.

5.1 What is the Deposit Rate distribution by Age-Group in the dataset?¶

In [13]:
df_age = df.groupby(by='age_range')['deposit'].agg(['mean','count']).reset_index()
df_age

# Plot a bar chart showing the deposit rates in every range

sns.set(style='whitegrid')
plt.figure(figsize=(10,6))
ax1 = sns.barplot(x='age_range',y='mean',data=df_age,color='skyblue',label='deposit ratio')
plt.xlabel('Age Range')
plt.ylabel('Deposit Rate')
plt.title('Deposit rate & customer count by age range')
plt.xticks(rotation=45)

# Add a second y-axis addrssing the count of customers
ax2 = ax1.twinx()
sns.lineplot(x='age_range',y='count',data=df_age,ax=ax2,marker='o',color='orange',label='Customer count')

ax1.set_ylabel('Deposit Rate')
ax2.set_ylabel('Customer Count')

# Display legends
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

Deposit ratio & customer count by age range demonstrates the positve deposits happening in every age bucket. As a whole we can see that the age bucke 91-100 has best deposit rate but has a very nominal count on which no conclusions can be drawn.

We must refer to both the deposit rate and the count of customer to come a solid conclusion. We can observe that the age bucket 21-30 has the best deposits to count ratio and the age bucket 31-40 has the worst depsoits to count ratio. Thus, we can say that the age group between 21-30 have invested in deposits the more through these promotion campaigns and the age group between 31-40 is investing less.

Dig deeper into 31-40 to investigate the issue.

In [14]:
df[df['age_range'] == '31-40'].groupby(by=['job','balance_range']).job.agg(['count']).reset_index()
Out[14]:
job balance_range count
0 admin. 0-10 88
1 admin. 11-5000 440
2 admin. 5001-15000 17
3 admin. 15001-45000 2
4 admin. 45001-90000 0
5 blue-collar 0-10 134
6 blue-collar 11-5000 582
7 blue-collar 5001-15000 29
8 blue-collar 15001-45000 5
9 blue-collar 45001-90000 0
10 entrepreneur 0-10 24
11 entrepreneur 11-5000 94
12 entrepreneur 5001-15000 7
13 entrepreneur 15001-45000 0
14 entrepreneur 45001-90000 0
15 housemaid 0-10 8
16 housemaid 11-5000 52
17 housemaid 5001-15000 4
18 housemaid 15001-45000 1
19 housemaid 45001-90000 0
20 management 0-10 158
21 management 11-5000 934
22 management 5001-15000 85
23 management 15001-45000 6
24 management 45001-90000 0
25 retired 0-10 0
26 retired 11-5000 3
27 retired 5001-15000 0
28 retired 15001-45000 0
29 retired 45001-90000 0
30 self-employed 0-10 15
31 self-employed 11-5000 140
32 self-employed 5001-15000 9
33 self-employed 15001-45000 2
34 self-employed 45001-90000 0
35 services 0-10 78
36 services 11-5000 323
37 services 5001-15000 16
38 services 15001-45000 1
39 services 45001-90000 0
40 student 0-10 4
41 student 11-5000 57
42 student 5001-15000 4
43 student 15001-45000 0
44 student 45001-90000 0
45 technician 0-10 123
46 technician 11-5000 680
47 technician 5001-15000 50
48 technician 15001-45000 8
49 technician 45001-90000 1
50 unemployed 0-10 13
51 unemployed 11-5000 102
52 unemployed 5001-15000 7
53 unemployed 15001-45000 0
54 unemployed 45001-90000 0
55 unknown 0-10 5
56 unknown 11-5000 6
57 unknown 5001-15000 1
58 unknown 15001-45000 0
59 unknown 45001-90000 0
In [15]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df_fil = df[df['age_range'] == '31-40']
df_fil = df_fil[['job','balance_range','deposit']]

# Pivot the data to create a table with Balance_Range as rows, Job as columns, and count of deposits as values
pivot_table = df_fil.pivot_table(index='balance_range', columns='job', values='deposit', aggfunc='mean', fill_value=0) * 100

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt='.1f', cmap='YlGnBu')
plt.title('Deposits by Balance Range and Job')
plt.xlabel('Job')
plt.ylabel('Balance Range')

plt.show()

The heatmap explains that the for the age group 31-40, high percentage of customer working across different job roles invest in the schemes. It mainly comes upto the balance that they have in their accounts.

Customers having a balance range of 11-5000 and 5001 - 15000 invest the most the in the schemes. The other balance groups don't show a consistent investment rate through all the job groups. Thus, it can be seen that the combination balance and job plays a major role for this age group to derive the percentage of deposits.

5.2. Identify what effect did having a credit default or loan have on the customers who wanted to deposit?¶

In [16]:
# Create subplots for each factor
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

# Plot Credit_Default vs. Deposit
sns.countplot(x='credit_default', hue='deposit', data=df, ax=axes[0])
axes[0].set_title('Credit Default vs Deposit')

# Plot Personal_Loan vs. Deposit
sns.countplot(x='personal_loan', hue='deposit', data=df, ax=axes[1])
axes[1].set_title('Personal loan vs Deposit')

# Plot Car_Loan vs. Deposit
sns.countplot(x='housing_loan', hue='deposit', data=df, ax=axes[2])
axes[2].set_title('Housing loan vs Deposit')

# Add legend to each subplot
for ax in axes:
    ax.legend(title='Deposit', labels=['No', 'Yes'])

# Adjust spacing between subplots
plt.tight_layout()

# Show the plots
plt.show()

Looking at the three comparisons, we can observe that:

  1. Cases with credit default hardly exist. Also, the customer count with no credit default is almost equally divided between the ones opting in and out of the deposit sceheme. Thus, We can understand that this factor is not influencing their decision.
  2. In cases with Personal loan, we observe that for the limited number of rows with a personal loan, most of them don't prefer to start a deposit scheme. But at the same time, the count of customers not having a personal are deposting is almost equal ot the one who are not. Thus, this does not look to be much of a concern when taking the decision to start a deposit.
  3. However, in the cases where the customers are having a Housing loan are clearly avoiding starting a deposit scheme. They prefer to stay out of it when compared to the ones not having any housing loan.

Thus, we can conclude that Housing loan is a parameter that the customers consider when they think on investing in a deposit scheme.

5.3. What is the deposit rate based on the educational background of the customer?¶

In [17]:
df_education = df.groupby(by='education')['deposit'].agg(['mean','count']).reset_index()
df_education

# Plot a bar chart showing the deposit rates in every range

sns.set(style='whitegrid')
plt.figure(figsize=(10,6))
ax1 = sns.barplot(x='education',y='mean',data=df_education,color='lightgreen',label='deposit ratio')
plt.xlabel('Education')
plt.ylabel('Deposit Rate')
plt.title('Deposit rate & customer count by Education')
plt.xticks(rotation=45)

# Add a second y-axis addrssing the count of customers
ax2 = ax1.twinx()
sns.lineplot(x='education',y='count',data=df_education,ax=ax2,marker='o',color='blue',label='Customer count')

ax1.set_ylabel('Deposit Rate')
ax2.set_ylabel('Customer Count')

# Display legends
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

From the Deposit Rate and Customer Count graph, we can understand that people who have completed either secondary or tertiary levels of education, have a higher preference on option for a deposit scheme as compared to the customers with just primary level of education.

Thus, when strategizing the next deposit cycle, the customers belonging to tertiary followed by secondary background can be targetted a little more.

5.4. What is the effect of time spent when explaining the deposit schemes and the followup done on the deposit status of the customer?¶

In [18]:
# Select the columns for boxplot visualization
selected_columns = ['duration', 'campaign']

# Create subplots for each selected variable
plt.figure(figsize=(12, 5))

for i, col in enumerate(selected_columns, 1):
    plt.subplot(1, 2, i)
    sns.boxplot(data=df, x='deposit', y=col, palette='Set2')
    plt.title(f'Box Plot for {col}')
    plt.xlabel('Deposit Status')
    plt.ylabel(col)

plt.tight_layout()
plt.show()

The box plot for duration indicates that in the cases where people deposited had a significant higher call duration which ultimately shows their interest in the scheme. The upperquartile for the YES cases wider the lower quartile for the same. Thus, more the engagement of the customer in the call, higher the chances that customer would invest in the deposit

The box plot for campaign indicates that there isn't any significant effect of contacting the same set of customers multiple times. The quadrants are almost equally distributed for the higher and lower quartiles indicating that the there isn't any significantt changes with the increase in the contact attempts.

5.5. Derive the Correlation between the columns in the dataset¶

In [19]:
df_corrleation = df[['credit_default','housing_loan','personal_loan','day','duration','campaign','days_since_last_contact','previous']]
corr_matrix = df_corrleation.corr()

plt.figure(figsize = (10,6))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm',linewidth=0.5)

#Add labels

plt.title('Correlation heatmap')
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

From the heatmap, we can observe that the correlation between the columns in the dataset is not much of a concern. Each column appears to be more or less independent of the other with the highest correlation factor of 51% observed beteen number of previous contacts and total duration of the contact call

There exists positive correlation between credit default, personal loan, housing loan, contact day of month and campaign contacts have a positive correlation with each other, where as previously contacted, duration of contact and days since last contact have a negaive correlation with each other.

6. Dimension Reducationality¶

We use dimension reductionality to primarily understand if we can compute the solution to the problem even after squeezing in the no of dimensions/features involved in the problem. We must understand the trade off done on features to improve the computational speed before using this technique

6.1 PCA¶

In [20]:
#Select the numerical dependent data where dimensions can be reduced. 

df_dimensions = df_original[['age','balance','campaign', 'days_since_last_contact', 'previous','duration']]
df_dimensions.head(3)
Out[20]:
age balance campaign days_since_last_contact previous duration
0 59 2343 1 -1 0 1042
1 56 45 1 -1 0 1467
2 41 1270 1 -1 0 1389
In [21]:
## Visualizing the original data before PCA

from sklearn.preprocessing import StandardScaler

numeric = df_dimensions.columns

#Visualize the original data
plt.figure(figsize=(12, 6))
for i, feature in enumerate(numeric, 1):
    plt.subplot(2, 3, i)
    plt.scatter(df_original[feature], df_original['deposit'], alpha=0.5)
    plt.title(f'Original {feature} vs Deposit Status')
    plt.xlabel(feature)
plt.tight_layout()
plt.show()

From the above plots, we can understand:

  1. There are data almost evenly distributed across all ages until 80 for both cases of deposits.
  2. Customer balance is heavily concentrated for values up 40k and has an outlier for point for balance >80k in the case where a deposit was made.
  3. For Original campaign, when the deposits are no, there are a some outlier points beyond 60.
  4. The days since last contact is heavily distributed in first half of number of days. For the cases where customer made deposits, there is significant number in the second half as well.
  5. For number of previous contacts, for cases where deposits where made, there are some extreme outlier points. Mostly, the datapoints are concentrated in the first of half.
  6. The duration in calls have higher range when the deposits were made as compared to when they were not.
In [22]:
#Scale the data 
scaler = StandardScaler()
scaled_data= scaler.fit_transform(df_dimensions)
df_scaled = pd.DataFrame(data=scaled_data, columns=numeric)
In [23]:
# Apply PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
pca = PCA(n_components=3)  # We'll keep the first 3 principal components for visualization
X_pca = pca.fit_transform(scaled_data)
In [24]:
import seaborn as sns
cmap = sns.set(style="darkgrid") 

# this function definition just formats the weights into readable strings
def get_feature_names_from_weights(weights, names):
    tmp_array = []
    for comp in weights:
        tmp_string = ''
        for fidx,f in enumerate(names):
            if fidx>0 and comp[fidx]>=0:
                tmp_string+='+'
            tmp_string += '%.2f*%s ' % (comp[fidx],f[:-5])
        tmp_array.append(tmp_string)
    return tmp_array
  
plt.style.use('default')

# Analyse to see how the components looks
pca_weight_strings = get_feature_names_from_weights(pca.components_, df_scaled.columns) 

# transformed output dataframe
df_pca = pd.DataFrame(X_pca,columns=[pca_weight_strings])

from matplotlib.pyplot import scatter

# Scatter plots to observe the reduced dimensions
color_dict = {0: 'green', 1: 'blue'}
point_colors = [color_dict[status] for status in df['deposit']]
ax = scatter(X_pca[:,0], X_pca[:,1],c=point_colors,  cmap=cmap)
plt.xlabel(pca_weight_strings[0]) 
plt.ylabel(pca_weight_strings[1])
plt.title('Scatter plot for reduced dimensions')
plt.legend('')
plt.show()

From the above Scatter plot for reduced dimension graph, we can see both the cases of deposits being distinguished.

However there is a good proportion of overlap in the features for both the cases of deposits. Thus, making a clear distinction at this point in a reduced dimension is quite difficult. We could test alternate methods or add more classifiable features in order to make this clear distinction.

Post PCA analysis to understand the dimensional space¶

In [25]:
#Visualize the variance tradeoff involved when reducing each dimensions
import numpy as np

def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
        

pca = PCA(n_components=6)
X_pca = pca.fit(scaled_data)
plot_explained_variance(pca)
C:\Users\Prashant\anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:544: DeprecationWarning:

plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis


C:\Users\Prashant\anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:572: DeprecationWarning:

plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis


We can see and interpret from above plot(Fig2) that only 45% variance can be convered by the 2 dimensionas and around 65% is covered when reduced to 3.

Also, since 2nd, 3rd and 4th dimensions display almost equal contribution to the variance, ignoring any of these dimensions might result in missing out of important features. Though, we can classify the information, that might not be of the best accuracy.

Given the use case, it is to identify and priortize the customer for deposit schemes, we can attempt on this model and evaluate the performance on this data to analyse the effect.

In [26]:
#We'll see and visualize the calculated eigen vectors
eigenvectors = pca.components_
eigenvectors
Out[26]:
array([[ 0.04714801,  0.07496487, -0.19556165,  0.69376745,  0.68552947,
        -0.05146893],
       [ 0.67745748,  0.69351567, -0.09534339, -0.07966926, -0.05365854,
         0.20437853],
       [ 0.22780032,  0.08849141,  0.62532506,  0.01400722,  0.08359068,
        -0.73624809],
       [-0.12833079,  0.06771689,  0.74464645,  0.08676539,  0.1727498 ,
         0.62215478],
       [-0.68557671,  0.70765288, -0.03691687, -0.02060104, -0.03211298,
        -0.16246066],
       [-0.02124177, -0.01507903, -0.07597802, -0.71006387,  0.69950624,
        -0.00700582]])
In [27]:
# Plot a heatmap to understand the correaltion and dependence of dimensions to vectors to decode the reduction process
plt.figure(figsize=(10, 6))
plt.imshow(pca.components_, cmap='viridis', aspect='auto')
plt.colorbar(label='Eigenvector Value')
plt.xticks(np.arange(6), [f'Eigenvector {i+1}' for i in range(6)])
plt.xlabel('Eigenvectors')
plt.ylabel('Dimensions')
plt.title('Heatmap of Eigenvector Values')
plt.show()

The heatmap for eigen vectors helps us interpret the influence of the dimensional space. The Eigen Vectors 1,4 & 6 have a high negative correlation in the 5th, 6th and 3rd dimensions respectively whereas eigen vector 2 has a high postive correlation in the 2nd & 5th dimensional space. The other dimensions have relatively equal distribution of the vectors involved in the reduction of the dimensional space.

6.2 Exploring UMAP¶

We tried exploring on how UMAP might behave when given the same dataset as an input to understand the process that UMAP follows:

In [28]:
# get the numeric columns
df_umap = df_original[['age','balance','campaign', 'days_since_last_contact', 'previous','duration']]
In [29]:
# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_umap)
In [30]:
# Model with n_components
umap_model = umap.UMAP(n_neighbors=5, n_components=2)  # Adjust parameters as needed
umap_result = umap_model.fit_transform(scaled_data)
In [31]:
# Map the output datapoints to their respective colors to understand the reduced distribution
color_dict = {0: 'green', 1: 'blue'}
point_colors = [color_dict[status] for status in df_original['deposit']]

plt.scatter(umap_result[:, 0], umap_result[:, 1],c=point_colors ,cmap='viridis')
plt.title('UMAP Projection')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.legend()
plt.show()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

The UMAP Projection indicates that the datapoints in the dataset share similar attributes. They share certain similar features in their original dimensional space. Thus, it shows some of the important datasets were lost during the reduction process. However, we still can observe some slight pattern which can be useful to this usecase in the classification process.

References¶

  1. [UCI]. Marketing Dataset. Retrieved from https://archive.ics.uci.edu/dataset/222/bank+marketing
  2. [Kaggle]. Marketing Dataset. Retrieved from https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset
  3. Moro, S., Cortez, P. & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems. 62, 22-31 https://dx.doi.org/10.1016/j.dss.2014.03.001